Skip to content

2.3.0 rc#5114

Merged
NathanFlurry merged 28 commits into
mainfrom
2.3.0-rc
Jun 2, 2026
Merged

2.3.0 rc#5114
NathanFlurry merged 28 commits into
mainfrom
2.3.0-rc

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

  • fix(rivetkit): exit pid1 after signal shutdown
  • fix(rivetkit): use engine actor stop threshold for shutdown
  • test(depot-client): stale vfs cache reads fail closed
  • test(depot-client): head fence read poisons vfs
  • test(depot-client): vfs stale page cache writer
  • test(depot-client): delayed read ahead stale pages
  • test(depot-client): startup preload stale pages
  • test(rivetkit-core): sqlite lifecycle fuzz harness
  • chore(kitchen-sink): agent load test
  • test(depot-client): batch atomic cap repro
  • test(depot-client): warm pidx stale read rmw repro
  • test(depot-client): natural warm pidx repro
  • test(depot-client): natural reopen warm pidx repro
  • [SLOP(claude-opus-4-7)] feat(envoy-client): add observability metrics for ws transport and sqlite request lifecycle
  • [SLOP(claude-opus-4-7)] fix(envoy-client): emit Stopped(Error) on lost-timeout to prevent silent destroy
  • Fix actor lost on envoy-client
  • DO NOT MERGE: serverless restart race condition
  • fix(rivetkit): use engine actor stop threshold for shutdown
  • test(kitchen-sink): sigterm sleep probe fixtures
  • feat(kitchen-sink): rust counter-latency harness
  • chore(kitchen-sink): refresh bench + smoke scripts
  • chore(kitchen-sink): counter actor + sigterm probe tweaks
  • chore(envoy-client): trace websocket backpressure
  • feat(envoy-client): add EnvoyStatusHandle wrapper
  • feat(rivetkit-core): wire EnvoyStatusHandle into dispatcher
  • feat(rivetkit-core): expose envoy status through /metrics
  • feat(rivetkit-napi): expose actorStopThresholdMs + envoy-aware health/metrics
  • feat(rivetkit-core): record connection close reason + lifetime metrics
  • feat(kitchen-sink): ws-ping fast-path on tunnel-stress + load-test-agent
  • Add debugging
  • fix(pegboard): add actor-scoped generation key for sqlite fencing
  • Revert "fix(pegboard): add actor-scoped generation key for sqlite fencing"
  • Cargo fmt
  • Fix actor generation validation for sqlite
  • [SLOP(claude-sonnet-4-5)] feat(metrics): add envoy lifecycle, stop reason, ws traffic, and js runtime metrics
  • [SLOP(claude-sonnet-4-5)] chore(logs): promote actor stop logs to info
  • [SLOP(claude-sonnet-4-5)] chore(logs): improve actor stop and envoy ping diagnostics
  • Remove slop
  • chore(kitchen-sink): add rivet cloud deploy workflow
  • [SLOP(gpt-5)] fix(rivetkit): reject comma-joined serverless endpoint header
  • [SLOP(gpt-5)] fix(rivetkit): disable cached serverless envoy by default
  • [SLOP(gpt-5)] fix(rivetkit): warn on cached serverless envoy regional mismatch
  • [SLOP(gpt-5)] docs(rivetkit): record performance audit notes
  • [SLOP(gpt-5)] test(envoy-client): update SharedContext fixtures for websocket diagnostics
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2555] Set up issue templates #1,gpt-5.5)] chore: perf(envoy-client): convert StdMutex SharedContext fields to scc
  • chore(kitchen-sink): update deployment diagnostics wiring
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2479] Send cluster events to PostHog #2,gpt-5.5)] chore: perf(envoy-client): replace ws_tx tokio Mutex with ArcSwapOption on hot path
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3,gpt-5.5)] chore: perf(envoy-client): replace BufferMap String keys with u64/[u8;8]
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2483] Remove hardcoded uses of rivet.gg #4,gpt-5.5)] chore: perf(rivetkit-core): sample record_inbox_depths instead of every loop iteration
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2358] Enable Redis memory overcommit #5,gpt-5.5)] chore: fix(rivetkit): repair setInterval missing-delay bug in actor-conn keepalive
  • perf(rivetkit-core): tighten queue_metadata lock around enqueue
  • perf(rivetkit-core, envoy-client): convert scc sync methods to async in async contexts
  • perf(envoy-client, guard): enable TCP_NODELAY by default + expose ws_tx_depth metric
  • Add gradual shutdown for load test
  • Fix actor stopped restart

@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5114 May 28, 2026 03:26 Destroyed
@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 28, 2026

🚅 Deployed to the rivet-pr-5114 environment in rivet-frontend

Service Status Web Updated (UTC)
ladle ◻️ Removed (View Logs) Web Jun 2, 2026 at 8:04 pm
frontend-inspector ◻️ Removed (View Logs) Web Jun 2, 2026 at 8:04 pm
website ◻️ Removed (View Logs) Web Jun 2, 2026 at 8:04 pm
frontend-cloud ❌ Build Failed (View Logs) Web Jun 2, 2026 at 8:04 pm
kitchen-sink ❌ Build Failed (View Logs) Web Jun 2, 2026 at 8:04 pm
mcp-hub 🕒 Building (View Logs) Web Jun 2, 2026 at 8:04 pm

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Code Review: PR #5114 — 2.3.0 RC

This is a large release candidate PR bundling multiple changes. I'll focus on the substantive Rust changes since the TypeScript SDK regeneration and image deletions are mechanical.


Critical Issues

1. vfs_retain_read_cache missing from SqliteOptimizationFlags — likely compile error

File: engine/packages/depot-client/src/vfs.rs and engine/packages/depot-client/src/optimization_flags.rs

VfsConfig::from_optimization_flags references flags.vfs_retain_read_cache, but SqliteOptimizationFlags does not appear to include this field. The VFS inline test at tests/inline/vfs.rs also references vfs_retain_read_cache: true. Either optimization_flags.rs was not updated in this PR or the code will not compile as-is.

Recommendation: Add pub vfs_retain_read_cache: bool to SqliteOptimizationFlags with a false default, along with environment-variable parsing consistent with the other vfs_* flags.


2. Unbounded channels on gateway hot paths remove all backpressure

Files: engine/packages/pegboard-gateway/src/shared_state.rs, engine/packages/pegboard-gateway2/src/shared_state.rs, engine/packages/universaldb/src/driver/postgres/transaction.rs, engine/packages/universaldb/src/driver/rocksdb/transaction.rs

All four files replace bounded mpsc::channel(128)/mpsc::channel(100) with mpsc::unbounded_channel(). While this eliminates .await on the send side, it removes the only backpressure mechanism on WebSocket-to-tunnel message forwarding and database transaction command paths. A stalled consumer will now grow the queue without bound.

The CLAUDE.md guideline is to use try_reserve helpers and return actor.overloaded rather than awaiting bounded sends. The correct fix is not unbounded channels but try_send with an overload error on full.

Additionally, forward_tunnel_message in pegboard-gateway2/src/shared_state.rs is still marked async fn after removing the .await — it can be made synchronous.

Recommendation: Either revert to bounded channels with try_send (dropping/logging on full), or document an explicit capacity bound and pair each unbounded channel with a queue-depth metric.


Moderate Issues

3. u32/u64 generation type inconsistency

Files: engine/packages/pegboard/src/keys/actor.rs, engine/packages/pegboard/src/workflows/actor2/runtime.rs, engine/packages/pegboard-envoy/src/ws_to_tunnel_task.rs

  • GenerationKey::Value is u32
  • AllocateInput.generation is u32
  • validate_remote_sqlite_generation takes generation: u64 and does u32::try_from(generation)

The u64 → u32 narrowing will fail for values above u32::MAX with the confusing error "invalid sqlite actor generation" rather than a generation fence error. Either store generation as u64 throughout, or narrow the validation function signature to u32 after confirming the protocol always sends in-range values.

4. "actor does not exist" string match overlaps with generation fence errors

File: engine/packages/depot-client/src/vfs.rs

fn is_initial_main_page_missing(message: &str) -> bool {
    message.contains("sqlite database was not found in this bucket branch")
        || message.contains("sqlite meta missing for get_pages")
        || message == "actor does not exist"
}

"actor does not exist" is also emitted by validate_remote_sqlite_generation on a generation fence mismatch. If a generation fence failure during the startup preload is mistakenly classified as a "new database" case, the VFS will silently open with an empty database rather than propagating the fence error. Use a distinct sentinel string for the startup-miss case, or match on a typed error variant instead of the message string.

5. Non-deterministic repro tests and eprintln! violations

File: engine/packages/depot-client/tests/inline/vfs.rs

warm_pidx_stale_read_then_rmw_commit_produces_malformed_db and the natural repro variant acknowledge in comments that they may not trigger on every run. Tests that pass regardless of whether the bug is present are not regression guards. Additionally, both tests use eprintln! throughout, which violates the CLAUDE.md rule against eprintln!/println! in Rust code — use tracing::warn! with structured fields.

Recommendation: Either mark these #[ignore] with a comment explaining they are repro attempts, or convert them to deterministic tests by injecting stale bytes directly through the test transport. Remove all eprintln! calls.


Minor Issues

6. High-frequency debug → info log promotions

File: engine/packages/depot-client/src/vfs.rs

Several calls were promoted from tracing::debug! to tracing::info!:

  • "vfs get_pages fetch" — emitted on every page fetch
  • "sqlite initial page preload request/result" — emitted on every actor open
  • "sqlite vfs close summary" and "vfs commit"

At info level, these will produce significant log volume on busy instances and risk obscuring actionable entries. These belong at debug level, or should only be emitted when values are out of expected range.

7. Metric renames break existing dashboards and alerts

File: engine/packages/pegboard-envoy/src/metrics.rs

All 12 metrics are renamed from the pegboard_envoy_* prefix to envoy_* (e.g., pegboard_envoy_connection_totalenvoy_connection_total). This is a breaking change for any existing Grafana dashboards or alerting rules referencing the old names. Either dual-register during a deprecation window or confirm no external queries reference the old names.

8. VFS name generation parsing should warn on None

File: engine/packages/depot-client/src/vfs.rs

let generation = name
    .rsplit_once("-g")
    .and_then(|(_, generation)| generation.parse::<u64>().ok());

When parsing fails, generation silently becomes None and disappears from logs. Emit a tracing::warn! when the parse fails so format changes are detectable.

9. retain_read_cache may need a #[cfg(test)] gate

File: engine/packages/depot-client/src/vfs.rs

This field appears to exist primarily to support the new stale-cache tests. If it has no production use case, gate it with #[cfg(test)] like assert_batch_atomic. If it does have a production use case, add a comment explaining the intended behavior and tradeoffs.

10. Initial allocation may skip GenerationKey write for actors without an allocation

File: engine/packages/pegboard/src/workflows/actor2/runtime.rs

The GenerationKey write in insert_state_and_db uses generation 0 for new actors. The validate_remote_sqlite_generation fallback path through the legacy ActorKey check handles actors that predate the generation key. If this migration path is intentional, document it explicitly and add a task to remove the fallback once migration is complete.


Positive Observations

  • GenerationFencedTransport is a clean, composable pattern that injects generation fencing at the transport boundary without polluting VFS internals.
  • SqliteOpenTimer as a RAII guard that always records the error metric (even on early return) is correct and idiomatic.
  • has_readable_page correctly unifies dirty-buffer, committed-cache, and page-cache checks to skip redundant read-ahead fetches.
  • The Drop impl on VfsContext emitting a close-summary diagnostic is useful for post-mortem analysis.
  • The new VFS test suite is thorough for the specific scenarios it covers (dead VFS, head fence mismatch, stale writer, delayed read-ahead, delayed startup preload).

Summary

Priority Issue
Critical vfs_retain_read_cache missing from SqliteOptimizationFlags — likely compile error
Critical Unbounded channels on gateway hot paths remove all backpressure
Moderate u32/u64 generation type inconsistency across key, input, and validation
Moderate "actor does not exist" string match in startup-miss overlaps with generation fence error
Moderate Repro tests use eprintln! and are non-deterministic regression guards
Minor Per-request debug → info log promotions will produce excessive log volume
Minor Metric renames from pegboard_envoy_*envoy_* break existing dashboards
Minor VFS name generation parsing should warn on None instead of silently losing context
Minor retain_read_cache may need #[cfg(test)] gate or production documentation

@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5114 May 28, 2026 21:33 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5114 May 28, 2026 21:40 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5114 May 28, 2026 21:42 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5114 June 2, 2026 19:26 Destroyed
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

This PR has a Rivet namespace. Learn more

Project Namespace Status Actions
rivet - Failed Logs

@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5114 June 2, 2026 19:42 Destroyed
@NathanFlurry NathanFlurry marked this pull request as ready for review June 2, 2026 19:47
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5114 June 2, 2026 20:04 Destroyed
@NathanFlurry NathanFlurry merged commit da2b5fc into main Jun 2, 2026
4 of 14 checks passed
@NathanFlurry NathanFlurry deleted the 2.3.0-rc branch June 2, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants